23. Quiz: Interpret the Policy

Quiz: Interpret the Policy

A policy determines how an agent chooses an action in response to the current state. In other words, it specifies how the agent responds to situations that the environment has presented.

Consider the recycling robot MDP from the previous lesson.

## Deterministic Policy: Example

An example deterministic policy \pi: \mathcal{S}\to\mathcal{A} can be specified as:

\pi(\text{low}) = \text{recharge}

\pi(\text{high}) = \text{search}

In this case,

  • if the battery level is low, the agent chooses to recharge the battery.
  • if the battery level is high, the agent chooses to search for cans.

Question 1

Consider a different deterministic policy \pi: \mathcal{S}\to\mathcal{A}, where:

\pi(\text{low}) = \text{search}

\pi(\text{high}) = \text{search}

Which of the following statements are true, if the agent follows the policy? (Select all that apply.)

SOLUTION:
  • If the state is _low_, the agent chooses action _search_.
  • The agent will always _search_ for cans at every time step (whether the battery level is _low_ or _high_).

## Stochastic Policy: Example

An example stochastic policy \pi: \mathcal{S}\times\mathcal{A}\to [0,1] can be specified as:

\pi(\text{recharge}|\text{low}) = 0.5

\pi(\text{wait}|\text{low}) = 0.4

\pi(\text{search}|\text{low}) = 0.1

\pi(\text{search}|\text{high}) = 0.9

\pi(\text{wait}|\text{high}) = 0.1

In this case,

  • if the battery level is low, the agent recharges the battery with 50% probability, waits for cans with 40% probability, and searches for cans with 10% probability.
  • if the battery level is high, the agent searches for cans with 90% probability and waits for cans with 10% probability.

Question 2

Consider a different stochastic policy \pi: \mathcal{S}\times\mathcal{A}\to [0,1], where:

\pi(\text{recharge}|\text{low}) = 0.3

\pi(\text{wait}|\text{low}) = 0.5

\pi(\text{search}|\text{low}) = 0.2

\pi(\text{search}|\text{high}) = 0.6

\pi(\text{wait}|\text{high}) = 0.4

Which of the following statements are true, if the agent follows the policy? (Select all that apply.)

SOLUTION:
  • If the battery level is _high_, the agent chooses to _search_ for a can with 60% probability, and otherwise _waits_ for a can.
  • If the battery level is _low_, the agent is most likely to decide to _wait_ for cans.